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Abstract 

Background: Human Malaria is transmitted by mosquitoes of the genus Anopheles. Transmission is a complex 
phenomenon involving biological and environmental factors of humans, parasites and mosquitoes. Among more 
than 500 anopheline species, only a few species from different branches of the mosquito evolutionary tree transmit 
malaria, suggesting that their vectorial capacity has evolved independently. Anopheles olbimonus (subgenus 
Nyssorhynchus) is an important malaria vector in the Americas. The divergence time between Anopheles gambiae, 
the main malaria vector in Africa, and the Neotropical vectors has been estimated to be 100 My. To better 
understand the biological basis of malaria transmission and to develop novel and effective means of vector control, 
there is a need to explore the mosquito biology beyond the An. gambiae complex. 

Results: We sequenced the transcriptome of the An. albimanus adult female. By combining Sanger, 454 and 
lllumina sequences from cDNA libraries derived from the midgut, cuticular fat body, dorsal vessel, salivary gland and 
whole body, we generated a single, high-quality assembly containing 16,669 transcripts, 92% of which mapped to 
the An. darlingi genome and covered 90% of the core eukaryotic genome. Bidirectional comparisons between the 
An. gambiae, An. darlingi and An. albimanus predicted proteomes allowed the identification of 3,772 putative 
orthologs. More than half of the transcripts had a match to proteins in other insect vectors and had an InterPro 
annotation. We identified several protein families that may be relevant to the study of P/asmod/L/m-mosquito 
interaction. An open source transcript annotation browser called GDAV (Genome-Delinked Annotation Viewer) was 
developed to facilitate public access to the data generated by this and future transcriptome projects. 

Conclusions: We have explored the adult female transcriptome of one important New World malaria vector. An. 
albimanus. We identified protein-coding transcripts involved in biological processes that may be relevant to the 
Plasmodium lifecycle and can serve as the starting point for searching targets for novel control strategies. Our data 
increase the available genomic information regarding An. albimanus several hundred-fold, and will facilitate 
molecular research in medical entomology, evolutionary biology, genomics and proteomics of anopheline 
mosquito vectors. The data reported in this manuscript is accessible to the community via the VectorBase website 
(http://www.vectorbase.org/Other/AdditionalOrganisms/). 
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Background 

Human malaria transmission is dependent on efficient 
development of Plasmodium parasites within anopheline 
mosquito vectors. Anopheline mosquitoes are a large 
subfamily comprising nearly five hundred species distrib- 
uted in subtropical and tropical areas around the world, 
but only a small percentage (10-20%) are malaria vectors 
[1]. Intriguingly, malaria infection rates among anophel- 
ine species do not correlate with mosquito phylogenetic 
relationships, suggesting that genetic traits associated 
with vectorial capacity have quickly and independently 
evolved in different species [2]. 

Vectorial capacity is a highly complex biological 
phenomenon depending on mosquito behavior, lifespan 
and innate refractoriness or susceptibility to Plasmo- 
dium infection, which may result from the co- 
evolutionary forces driving the tripartite interaction be- 
tween humans, mosquitoes and parasites [3]. As a result 
of the sequencing of the Anopheles gambiae and Plasmo- 
dium falciparum genomes, a great deal of information 
regarding our understanding of mosquito-pathogen 
interactions at a molecular level has been gained [4,5]. 
Post genome research has highlighted the role of the An. 
gambiae innate immune response in determining mos- 
quito refractoriness to Plasmodium [6-8]. 

In addition to the mosquito's innate ability to transmit 
malarial parasites, critical aspects of mosquito biology, 
like adaptation to diverse niches, host seeking behavior, 
and resistance to insecticides are still unknown. Novel 
strategies for control, based on a deep understanding of 
mosquito biology and evolution, will be required to 
achieve the goal of eventual malaria eradication. Rapid 
technological advances in DNA sequencing, protein 
characterization by mass spectrometry and bioinformat- 
ics offer unique opportunities to generate large catalogs 
of genes, proteins and biological networks that may en- 
able the identification of potential mosquito control tar- 
gets beyond the An, gambiae-P, falciparum dyad [9], 
which can be harnessed by novel strategies such as 
transgenesis, mosquito-based transmission blocking vac- 
cines [10,11] and alternative insecticides [12]. 

Plasmodium vivax malaria still represents a major 
health and socio-economic burden in Asia, the Western 
Pacific and the Americas [13,14]. Malaria in the Ameri- 
cas is transmitted by several anopheline species, includ- 
ing species belonging to the Nyssorhynchus subgenus, 
which is unique to the New World. Anopheles (Nyssor- 
hynchus) albimanus is a major vector in southern Mex- 
ico, Central America and the northern region of South 
America [15,16]. The Nyssorhynchus subgenus is 
thought to be the earliest diverged branch of the anoph- 
eline radiation, which probably occurred more than one 
hundred million years (My) ago when the supercontin- 
ent Gondwana separated to give rise to the actual South 



American and African continents [17]. The proposed in- 
dependent emergence of vectorial traits in conjunction 
with their rapid evolution may imply different molecular 
strategies involved in P, vivax and P, falciparum refrac- 
toriness and susceptibility in this subgenus. However, 
very little molecular information exists for any of the 
New World anopheline vectors, except for Anopheles 
(Nyssorhynchus) darlingi whose genome draft was re- 
cently released [18]. The An, albimanus genome is 
scheduled for genome sequencing in the near future [2]. 

We describe herein the results of a gene discovery based 
cDNA sequencing project combining conventional Sanger 
with Next Generation Sequencing (NGS) platforms to 
analyze cDNA samples derived from An, albimanus tissues 
including the midgut, cuticular fat body, dorsal vessel and 
salivary gland, which are critical organs involved in the 
Plasmodium life cycle [6,19,20] (Hernandez-Martinez, Un- 
published observations). The main objective of our study 
was to construct a reference transcriptome that will facili- 
tate molecular and applied studies of An, albimanus refrac- 
toriness to Plasmodium infection and other biological 
processes relevant for disease transmission. Finally, to 
maximize accessibility for this and future transcriptome se- 
quencing projects in the absence of a genome sequence, we 
developed an open-source sequence annotation browser 
called GDAV (Genome-Delinked Annotation Viewer; 
http://funcgen.vectorbase.org/gdav). The An, albimanus 
transcriptome annotations are available via the VectorBase 
website (http://www.vectorbase.org/Other/AdditionalOr- 
ganisms/). 

Results and discussion 

The An, albimanus transcriptome assembly 

To capture as much of the transcriptome of the tissues 
involved in the interaction with Plasmodium spp, as pos- 
sible, we combined transcriptome data generated with 
Sanger, 454 and lUumina sequencing platforms from sev- 
eral different sources of adult female An, albimanus 
RNA. Table lA describes the origin of the RNA used to 
generate cDNA libraries sequenced by the different plat- 
forms, as well as the read number contribution of each. 
Owing to the inherent differences in throughput of each 
of the sequencing platforms used, most of the final data- 
set contains Illumina reads derived from the midgut 
transcriptome (94%), and nearly half of the transcripts 
were built with at least one 454 read. Only 6% of the 
built transcripts contained one or more Sanger reads 
(Table IB). There were 8,958 (54%) transcripts expressed 
exclusively in the midgut, 35 (0.2%) were derived specif- 
ically from cuticular epithelium/fat body, and 80 (0.47%) 
were expressed only in the dorsal vessel, whereas the 
rest (45%) were found in two or more tissues (Figure 1). 

An initial transcriptome assembly was performed by 
combining all sequence reads described in Table lA, 
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Table 1 Libraries, sequencing metrics and assembly 



A: Tissue and sequencing platform 


Tissue source lllumina (reads) 


454 


Sanger 




(reads) 


(reads) 


Midgut 210x10^ 


62,703"'^ 


1,017 


Abdominal cuticle^ 


111,566^ 


605 


Dorsal vessel 


233,984** 




Salivary gland 




1,038 


Whole female*" 




2,121 


Totals 210x10^ 


408,253 


4,781 


B: Transcript composition 




Transcripts 


(%) 


Transcripts with lllumina reads 


15,764 


94.4 


Transcripts with 454 reads 


CO 


48.3 


Transcripts with Sanger reads 


1,051 


6.3 


C: Transcript assembly summary 


Total transcripts 




16,699 


Total bases (Mbp) 




16.1 


Mean contig length (bp) 




970 


N50 (bp) 




1,434 



^ Includes abdominal fat body. 
^ Inoculated with 5. marcescens. 

Fix. 250 bp read length. 

Titanium 400 bp read length. 
® P. vivax infected and non-infected. 



which generated 15,764 contigs. Re-mapping the Sanger 
and 454 reads to the initial assembly revealed that a sub- 
stantial number of 454 (53%) and Sanger reads (44%) 
were not represented in the initial transcriptome assem- 
bly, so they were re-assembled using GS Assembler, 




Figure 1 Origin of transcripts according to mosquito organ. 

Distribution of transcript number (n = 16,572) according to the 
mosquito organ. The area of each circle is roughly a proportional 
approximation of the number of transcripts. A marked bias towards 
midgut is noted due to the enormous throughput of lllumina 
sequencing of a midgut library. Libraries sequenced by Sanger were 
left out given the low throughput of Sanger sequencing and for 
simplicity. Thus, not all the 16,669 transcripts are represented. 

v J 



generating an additional set of 935 contigs. The total 
transcript dataset included 16,699 contigs with a mean 
contig length of 970 bp and a N50 of 1,434 bp 
(Table IC). Our assembly metrics closely resemble the 
recently published NGS and Sanger data on the An. 
funestus transcriptome, which yielded 15,527 contigs 
with a N50 of 1,753 bp [21]. Similar results were also 
reported in de novo transcriptome assembly studies for 
the planarian worm, Schmidtea mediterranea in which 
454 reads were pre-assembled to serve as scaffolds for 
lllumina paired-end assembly, yielding 17,465 contigs 
with a N50 of 1.6 kb [22], and 18,619 contigs with a 
mean length of 1,118 bp [23]. 

As a surrogate approach to estimate the contribution 
of each sequencing platform to the assembly quality, the 
transcript dataset was split into two subsets according to 
whether transcripts were assembled with only lllumina 
reads or with lllumina and 454 or Sanger reads. The 
Illumina-only subset was composed of 8,245 transcripts 
with a mean contig length of 822 bp and a N50 of 1,111 
bp. The composite subset (8,445 transcripts build up 
with lllumina + 454 or Sanger) had a mean contig length 
of 1,087 bp and a N50 of 1,648 bp (Figure 2A), which is 
similar to the recently published Aedes albopictus tran- 
scriptome using 454 sequencing [24]. We then evaluated 
if there were differences in the proportion of homolog 
proteins in the An, gamhiae predicted proteome using 
BLASTX (e value of 1.0 E'^) between both subsets. We 
did not observe a difference in the proportion of tran- 
script matches between the Illumina-only subset and the 
composite ones (51% and 54%, respectively). However, 
the An. gamhiae protein length coverage of translated 
transcripts was considerably improved in those tran- 
scripts belonging to the composite subset (Figure 2B) 
since 53% of their transcripts covered more than 70% of 
the An. gamhiae target, whereas only 25% of the 
Illumina-only transcripts had an equivalent An. gamhiae 
target coverage. 

Genome mapping results to J\n, gambiae and An. darlingi 

Since there is no genome sequence available for An. 
albimanus, unambiguous sequence alignment of An. 
albimanus transcripts to a reference anopheline genome 
could provide additional measures of the transcriptome 
assembly accuracy and completeness. Moreover, it could 
provide the means for refining reference genome anno- 
tation and provide evidence of functionally conserved 
genes that were missed by current gene finding algo- 
rithms [25]. The final cDNA assembly was mapped to 
the An. gambiae (100 My divergence, subgenus Cellia) 
and An. darlingi genomes (closer relative from subgenus 
Nyssorynchus). The current version of the An. gambiae 
PEST strain genome (AgamP3.6) is 278.2 Mbp long and 
it is in an advanced stage of assembly (oriented scaffolds) 
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Transcript length (bp) Coverage {%) # transcripts per An.gambiae gene 

Figure 2 Transcript length and protein length coverage of the An. albimanus transcriptome. The whole transcriptome was split in two 
subsets according to transcripts composed of lllumina reads only or lllumina plus either of the two other platforms (composite subset). To make 
datasets comparable, their frequency (y axis) was expressed as a percentage. A) Transcript length distribution for both subsets and the whole 
dataset reveals slightly longer transcripts in the composite subset. B) An. gambiae protein length coverage obtained by BLASTX with An. 
olbinnonus transcripts. A larger fraction of the lllumina only subset covers less than one third of its respective match in An. gombloe. C) Frequency 
of mapped An. olbinnonus transcripts per An. gombloe gene. Transcripts were mapped to the An. gombloe genome using Exonerate v 2.2 [79] in 
the EST to genome mode (DNA vs DNA). 



[26] and annotation (13,320 predicted genes) [27]. The 
An, darlingi genome was sequenced using 454 sequen- 
cing and is in an early stage of assembly (3,990 un- 
oriented contigs) and annotation (11,430 predicted pro- 
tein coding genes) [18]. Using transcript to genome 
DNA alignment {see methods), 15,441 An, albimanus 
transcripts (92%) aligned (Exonerate bestn score >300) 
to the An, darlingi genome. Expressed as aligned base 
pairs, 13.8 of 16.1 Mbp (85%) of the An, albimanus tran- 
scriptome aligned to the An, darlingi genome (Table 2). 
Transcriptome mapping with the same parameters to 
the An, gambiae genome resulted in 9,648 aligned con- 
tigs (58%) with 7.1 Mbp (46%) aligned. Contigs that 
aligned to An, darlingi but not to the An, gambiae gen- 
ome were predominantly short contigs (0.3-1.5 Kb) 
(Table 2. Figure 3A). As expected from the overall tran- 
scriptome aligned fraction, the fraction of An, albimanus 
transcript sequences that aligned to An, darlingi 



Table 2 Transcript mapping to Anopheles genomes^ 



Reference genome 


An. gambiae 


An. darlingi 


Unique alignments 


9,648 




15,441 92 


Aligned bases (Mbp) 


7,3 


46 


13,8 85 


Number of transcript 
with introns 


5,438 


33 


8,365 50 


Transcripts mapping to genes 


6,305 




ND 


Single transcript per An, 
gambiae gene 


3,949 


62^ 


ND 


Single lllumina-only per An, 
gambiae gene 


2,707 




ND 


Single composite per An, 
gambiae gene 


3,386 


80'' 


ND 



^ Mapping was done using Exonerate v 2.2. using the EST2genome mode 
(DNA transcript aligned to genomic DNA)[79]. 

^ Numbers in bold letters indicate the fraction (%) of /An. albimanus dataset 

Percentage of total mapped. 
^ Percentage of corresponding subset. 



(transcript coverage) was considerably higher than for 
the An, gambiae genome, such that 76% of An, albima- 
nus transcripts that were mapped to the An, darlingi 
genome aligned with more than 90% of their respective 
length (Figure 3B). Our mapping results using An, dar- 
lingi as the reference genome are significantly better 
than those described for the planarian worm S, mediter- 
ranean which were mapped to its own genome [23], sug- 
gesting that despite its early stage of annotation, the An, 
darlingi genome is an appropriate surrogate genome for 
An, albimanus, and supports the accuracy and quality of 
our assembly. Transcripts that did not map to the An, 
darlingi genome (8%), as well as partial alignments may 
represent mis-assemblies in the transcriptome or the 
genome, rapidly evolving genes or the rapid evolution of 
untranslated regions (UTRs). Conversely, poor mapping 
to the An, gambiae reference genome may be result of 
all the above, further confounded by the increased evo- 
lutionary distance separating the two species. 

An analysis of the protein length coverage for An, albi- 
manus-An,gambiae BLASTX alignments showed a consid- 
erable proportion of transcripts with only partial coverage 
of their corresponding An, gambiae match (Figure 2B), in- 
dicating the possibility of multiple hits per An, gambiae 
gene. The extent of multiple transcripts mapping to a single 
An, gambiae gene was estimated by comparing transcript 
to genome and An, gambiae gene coordinates. Anopheles 
albimanus transcripts mapped partially or fully within the 
sequences of 6,305 An, gambiae genes, and 62% of these 
transcripts did so within a single An, gambiae gene (Table 2. 
Figure 2C). Based on our protein length coverage observa- 
tions (Figure 2B), we noted a higher proportion of single 
gene hits in the composite subset (80%) than in the 
lUumina-only subset (68%) (Figure 2C). 

As an additional approach to estimate transcript com- 
pleteness, we used reference genome mapping to analyze 
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— Agamp3 
^—An. darlingi (Ref) 
An. gambiae (Ref) 



Transcript length (bp) Transcript sequence coverage (%) ftExons 

Figure 3 Reference genome mapping of the An. albimanus transcriptome. Transcripts were mapped to the An. gambiae (blue line) or the 
An. darlingi (red line) genomes using Exonerate v 2.2 as in Figure 2C Mapped transcripts are expressed as a percentage of the total An. 
albimanus dataset. A) Aligned length plot; B) Fraction of the An. albimanus transcripts that aligned to either genome; C) Putative exon number 
relative to the proportion of An. albimanus mapped transcripts. For comparison, distribution of exon frequency in An. gambiae is shown (dotted 
black line). 



exon coverage. As expected from the phylogenetic rela- 
tionships, transcript mapping to the An. darlingi gen- 
ome showed higher average exon coverage (2.1 exons 
per transcript) than to the An. gambiae genome (1.9 
exons per transcript), which falls below the 4.4 exons 
per transcript in An. gambiae [28] (Figure 3C). However, 
there was no proportional difference in the exon cover- 
age using An. gambiae or An. darlingi relative to the 
number of significant alignments (data not shown). 
More than half (54.1%) of our mapped transcript dataset 
covers at least two exons, whereas in An. gambiae, more 
than 86% of the genes are composed of at least two 
exons (Figure 3C). Only ten and thirty-one transcripts 
that mapped to An. gambiae and An. darlingi, respect- 
ively, had more than 10 exons, compared to 742 for An. 
gambiae. The An. albimanus transcript that covered 
most exons (23) when mapped to the An. darlingi gen- 
ome was Locus_3073_Length_3,997, and corresponds to 
the An. gambiae AGAP000009 homolog, which is pre- 
dicted to have 25 exons (data not shown). 

In summary, transcript mapping to reference genomes 
and the derived analysis of exon structure of our tran- 
script dataset revealed a degree of incompleteness when 
using An. gambiae as our reference. The increased pro- 
portion of genome alignments without a spanning intron 
(single exon transcripts) observed in An. albimanus 
could result from incorrect splitting of the transcript 
during the assembly of transcriptional units having more 
than one exon. Correct exon representation is relevant 
because alternative splicing is a means to increase prote- 
ome diversity and this phenomenon has been observed 
frequently among dipterans [29,30]. Incomplete exon- 
exon structure across An. albimanus transcripts could 
underestimate the diversity of protein configurations 
and thus, may limit protein identification by proteomic 
approaches in the absence of the genome, and empirical 
studies to assess this possibility are required to fill in this 
gap in knowledge. Despite the limitations in our dataset. 



information regarding exon-exon structure may be use- 
ful for experimentalists when designing primers and 
probes for one gene-targeted analysis. 

Estimated proteome coverage 

As a starting point for transcript annotation, the propor- 
tion of the An. albimanus transcriptome that was hom- 
ologous to a predicted protein sequence in other 
genomes was analyzed. Protein similarity to other insect 
proteomes and the NCBI nr databases was assessed 
using BLASTX (using an e-value threshold of l.OE'^). A 
total of 10,000 sequences (62%) in our dataset had a sig- 
nificant match with at least one species. However, we 
note that probably due to methodological differences, 
this proportion is lower than the 84% match described 
by Crawford, when the An. funestus transcriptome was 
compared with the An. gambiae proteome [21]. 

Contrary to what would be expected based on degree 
of evolutionary relationships, and the results observed in 
transcript to genome alignments, we observed that a 
higher proportion of An. albimanus transcripts (56%) 
had a match with An. gambiae than with An. darlingi 
(54%) (Table 3. Figure 4B). However, a comparison be- 
tween An. darlingi and An. gambiae revealed that 80% 
of the An. darlingi proteins matched the An. gambiae 
proteome. This can be partially explained because the 
An. albimanus predicted proteome is considerably larger 
(16,699) than the An. darlingi predicted proteome. 

Considering the total amount of protein coding genes in 
each of the sequenced vector genomes, 51% oiAn. gambiae 
and 48% of the An. darlingi protein coding genes matched 
at least one transcript in An. albimanus, whereas An. dar- 
lingi covered 62% of the An. gambiae proteome. Proteome 
coverage tended to decrease according to phylogenetic dis- 
tance (Figure 4B). However, the proportion of significant 
best hits decreased according to phylogenetic relatedness, 
so that protein percent identity among best BLASTX hits 
was the highest with An. darlingi (average identity 85%), 
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Table 3 BLASTX comparisons to other Insect predicted proteomes 





An. cilbinicinus 
matchs 


% 


Average identity SD 

% 


Median Identity 

(%) 


Reference proteome 
size 


Reference Proteome coverage 

(%) 


An, darlingi 


9,116 


54.6 


85.8 


17.1 


92.0 


11,764 


48.4 


An, gambiae 


9,445 


56.6 


76.3 


17.8 


80.3 


12,670 


51.7 


Ae, aegypti 


9,030 


54.1 


69.5 


18.4 


71.7 


15,988 


40.3 


C, 

quinquefasciatus 


8,936 


53.5 


68.8 


17.1 


70.9 


18,883 


33.6 


D, melanogaster 


8,069 


48.3 


57.5 


17.9 


57.0 


13,804 


41.9 


1, scapularis 


6,446 


38.6 


49.7 


16.6 


47.4 


20,486 


21.7 


P, humanus 


7,396 


44.3 


53.4 


17.7 


51.4 


10,783 


46.0 



and decreased according to phylogenetic distance (Table 3. 
Figure 4A-B). 

The lower number of transcript matches observed be- 
tween An. albimanus and An. darlingi than between An. 
albimanus and An. gambiae is lil<ely the consequence of 
both the incompleteness of the An. albimanus dataset due 
to sampling of only a limited set of tissues from adult mos- 
quitoes, and the current incomplete assembly status of the 
An. darlingi genome (Figure 4B). Moreover, the An. gam- 
biae genome annotation relied heavily on homology based 
annotation approaches [5], and thus is very conservative. 
This implies that rapidly evolving genes may escape identi- 
fication by such conservative approaches. Combining con- 
servative and ab initio gene prediction strategies increased 
considerably the amount and quality of the An. gambiae 
protein coding genes [25]. Given the fact that most of the 
An. albimanus transcripts mapped to the An. darlingi gen- 
ome, many transcripts may in fact be derived from true 
protein coding genes. Additionally, many of the sequences 
could represent non-protein coding transcripts of potential 
biological significance [31]. 

Orthologs 

The proportion of the core eukaryotic genome covered 
by the An. albimanus transcriptome was investigated by 



searching for the 458 core eukaryotic protein models 
[32] in the An. albimanus predicted proteome, as well as 
the predicted proteomes of An. darlingi^ An. gambiae, 
Ae. aegypti, C. quinquefasciatus D. melanogaster, L sca- 
pularis, P. humanus and Rhodnius prolixus. As 
expected, we observed almost complete coverage for D. 
melanogaster (99%) and An. gambiae genomes (98%). 
We identified 415 core eukaryotic genes (CEGs) in the 
An. albimanus dataset corresponding to 90% coverage. 
Coverage ranged from 95-98% in the other genome 
sequenced species. The lowest coverage was observed 
for An. darlingi (88%), which as discussed in the previ- 
ous section, may reflect the early stage of that sequen- 
cing project (Table 4). 

An important goal of the Anopheles genome cluster 
will be to define a "core anopheline genome". Due to the 
incompleteness of the predicted proteome for An. albi- 
manus, we reasoned that an An. albimanus - An. gam- 
biae comparison could be very inaccurate in terms of 
ortholog prediction. However, using An. darlingi as an 
additional species for bidirectional comparisons, specifi- 
city could be increased but at the expense of sensitivity. 
To identify putative orthologs we used BLASTX and 
TBLASTN for bidirectional comparisons among the An. 
gambiae. An. darlingi and An. albimanus proteomes. 
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0 



10.0 60.0 
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Figure 4 Proteome comparison to other insect proteomes. The An. olbimonus transcriptome was compared to insect proteomes using 
BLASTX (e val. l.E"°^). A) Distribution of protein identity (%) of all high scoring pairs (HSP) for An. darlingi (Adar, red), An. gombioe (Agam, blue), Ae. 
aegypti (Aea, light green), C. quinquefasciatus (Cqf, dark green), D. melanogaster (Dmel, pink); /. scapularis (Isca, yellow) and P. humanus (Phum, 
orange). B) Average protein identity An. albimanus best hits to insect proteomes. Color codes for each species as in A. 
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Table 4 Proportion of the core eukaryotic genome 



Species 


CEGs (%) 


CEGs in 1:1:1 BRH 
dataset 


(%) 


CEGs in 1:1 BRH. ^n. 
albimanus 


(%) 


CEGs in 1:1 BRH. >An. 
gambiae 


(%) 


CEGs in 1:1 BRH.yAn. 
darlingi 


(%) 


An, albimanus 


415 90.6 


283 


62 


NA 


NA 


374 


82 


320 


70 


An, gambiae 


453 98.9 


283 


62 


374 


82 


NA 


NA 


361 


79 


An, darlingi 


403 88.0 


295 


64 


333 


73 


384 


84 


NA 


NA 


Ae, aegypti 


449 98.0 


ND 




ND 




ND 




ND 




C, 

quinquefasciatus 


440 96.1 


ND 




ND 




ND 




ND 




D, melanogaster 456 99.6 


ND 




ND 




ND 




ND 




/. scapularis 


440 96.1 


ND 




ND 




ND 




ND 




P, humanus 


456 99.6 


ND 




ND 




ND 




ND 





R,prolixus 439 95.9 ND ND ND ND 



Anopheles albimanus - An. darlingi^ An. albimanus - 
An. gambiae and An. darlingi - An. gambiae bidirec- 
tional comparisons revealed 5,029, 5,556 and 7,609 best 
reciprocal hits, respectively. The three species compari- 
son yielded a set of 3,772 1:1:1 putative orthologs (32% 
and 29% of the An. darlingi and An. gambiae predicted 
proteome, respectively) (Figure 5A). The 1:1 ortholog 
dataset between An. darlingi and An. gambiae com- 
prised 64% and 60% of their proteome, respectively. The 
proportion of orthologs between An. gambiae and D. 
melanogaster is 47 and 44%, respectively [28], and ran- 
ged between 73% (D. melanogaster -D. grimshawi) to 
93% (D. melanogaster-D. yakuba) within the Drosophila 
cluster [33]. 

To further validate the accuracy of our ortholog as- 
signment, we compared protein length coverage, using 
pair-wise alignments between translated An. albimanus 
proteins and their corresponding best match in An. gam- 
biae. A considerable improvement of protein length 
coverage was observed in orthologs (average protein 
length coverage was 74%), when compared to the 



protein coverage of the overall dataset (average protein 
length coverage of 53%). As described previously 
(Figure 2B), protein coverage displayed a bimodal distri- 
bution where 40% of the translated An. albimanus pro- 
ducts covered at least two thirds of their corresponding 
An. gambiae matches and another 40% covered less than 
one third (Figure 5B). A significant and expected im- 
provement was observed in the ortholog dataset, which 
was biased toward higher coverage, resulting in 66% of 
the orthologs covering more than two thirds of their 
corresponding An. gambiae matches (Figure 5B). 

If our BLAST best reciprocal hit ortholog prediction 
strategy was accurate, we would expect that all the core 
eukaryotic genes (CEGs) found would be represented in 
the 1:1:1 ortholog dataset. As shown in Table 4, only 283 
of the identified An. albimanus CEGs were included in 
the three species ortholog dataset, which corresponds to 
61% of the core eukaryotic genome. However, the pro- 
portion of CEGs found between two species best recip- 
rocal hits was higher and ranged from 70 to 83%. 
Together, our data indicate that although there is a 




Protein sequence cfwerage (%) 



Figure 5 Ortholog prediction in An. albimanus. Best reciprocal hits were identified by bidirectional BLASTX, BLASTP and TBLASTN comparisons 
between An. olbimonus transcriptome, An. darlingi and An. gombioe genomes. A) Venn diagram depicting the numbers of entries for each 
comparison. The area of each circle is a proportional approximation of the number of transcripts in each. B) Protein length coverage frequency 
distribution in total An. albinnanus translated transcripts matching the An. gombioe proteome (n = 9,445. Green line) and the 1:1:1 putative 
ortholog dataset (n = 3,772. Red line). 

V ) 
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relatively high coverage of the core eukaryotic genome, 
the anopheline ortholog assignment remains incomplete 
(i.e., it is missing a considerable amount of true ortho- 
logs), and may be inaccurate (i.e., includes a significant 
proportion of paralogs). 

Functional annotation of the An. albimanus transcriptome 

To gain insight into the predicted functional characteris- 
tics of the current dataset, translated products were 
functionally annotated using the Gene Ontology (GO) 
classification as implemented in the Blast2GO software 
[34], as well as using the InterPro classification [35]. A 
total of 5,086 transcripts were annotated according to 
biological process, and 5,792 transcripts were annotated 
according to molecular function, which corresponds 
roughly to one third of all the transcripts in the dataset 
for both categories, and to one half of the total BLASTX 
matches in the vectors' predicted proteomes (Table 3). 

To gain insight into the protein evolution rate accord- 
ing to biological process and molecular function, the 
annotated transcriptome dataset was partitioned accord- 
ing to the most abundant biological process and molecu- 
lar function GO slim categories. Based on BLASTX 
matches, average protein percent identity was estimated 
for each GO slim category, either in the 1:1:1 ortholog 
dataset, or as in all best hits (homologs) against the An. 
gambiae predicted proteome dataset (data not shown). 
For both the homologous and orthologous subsets, most 
conserved proteins belonged to general biological pro- 
cesses such as cytoskeletal organization, ion transport, 
translation and protein transport (Figure 6A), in agree- 
ment with other comparative studies in mosquitoes and 
D. melanogaster [28,36]. Putative homologous and 
orthologous proteins assigned as having stress response 
and transcription functions were the least conserved {P 

< 0.05), consistent with the notion of higher evolution- 
ary rates in genes involved in immune response 
[28,33,37,38], which are a subset of stress response 
genes. Although the average protein identity observed in 
the response to stress genes category was lower than the 
average protein identity in the entire annotated ortholog 
dataset, such differences were not statistically significant 
(P >0.05). Also, the average protein identity observed in 
the response to stress genes category was significantly 
higher relative to the average protein identity in the en- 
tire transcriptome (matching proteins in An. gambiae) {P 

< 0.5). This may be the result of the high proportion of 
genes present in our dataset that have no GO annotation 
and that, not surprisingly, are the least conserved 
(Figure 6A). 

According to molecular function, the best protein 
matches, those that were involved in translation, struc- 
tural functions, calcium and actin binding were the most 
conserved, which agrees with the initial biological 



process categorization. The least conserved putative pro- 
teins were categorized as electron carrier and proteases, 
whose conservation was significantly lower than that 
observed for the translation, actin binding and calcium 
binding (P < 0.05) ontologies. Proteases were signifi- 
cantly less conserved than the whole set of annotated 
predicted proteins (Figure 6B). Proteases have been 
reported as being subjected to positive selection in the 
Drosophila genus [36]. Also, trypsin family proteases 
have been significantly expanded in An. gambiae com- 
pared to D. melanogaster, suggesting faster evolution in 
anophelines that may be related to hematophagy [28,39]. 
Similarly, electron carrier activity is related to cyto- 
chrome function and is overrepresented in our dataset. 
The cytochrome P450 family, involved in insecticide re- 
sistance, is also expanded in An. gambiae [40] and may 
be also evolving rapidly in New World anophelines. 

The transcript dataset was also annotated using Inter- 
ProScan, which yielded 17,850 InterPro annotations. We 
noted that 7,154 (42%) transcripts have at least one 
InterPro annotation with an average of 2.4 annotations 
per annotated transcript. Annotation distribution was 
similar to An. gambiae (Table 5). The most abundant 
annotations, such as zinc-fingers (IPR007087), WD-40 
repeat (IPR001680), Protein kinase domain (IPR011009), 
Armadillo-like fold (IPR016024) and Serine/Cysteine 
proteases (IPR009003) and the general substrate trans- 
porters of the Major facilitator superfamily (IPR016196), 
were among the top most frequent annotations for both 
the An. albimanus transcriptome and the An. gambiae 
predicted proteome [41]. This similarity suggests that in 
general terms and despite being derived from a limited 
set of adult tissues, the An. albimanus dataset exhibits a 
similar representation to that of An. gambiae. However, 
certain InterPro annotations were more abundant in the 
An. albimanus dataset than in An. gambiae. For ex- 
ample, the C-terminal-like Glutathione S -transferase 
(GST) (IPR010987) was ranked in the 26th place versus 
79th in An. gambiae and contained almost the complete 
set (32 out of 38 in An. gambiae). Other highly ranked 
annotations that contained near-complete expected sets 
were E2 Ubiquitin-conjugating enzymes (IPR000608) (24 
of 26); and Peptidase Ml, membrane alanine aminopep- 
tidase, (IPR014782) (20 out of 25 in An. gambiae). Fi- 
nally, 41 An. albimanus transcripts were annotated as 
Protein of Unknown Function DUF227 (IPR004119), 
which ranked among the top 20 most abundant anno- 
tated transcripts, whereas in An. gambiae this classifica- 
tion was ranked in the 69th place with 43 annotated 
proteins. Considering the over-representation of midgut- 
derived transcripts in our dataset (Figure 1), aminopepti- 
dase enrichment is expected due to their proteolytic role 
in blood digestion [42]. GSTs and other detoxification- 
related gene expression is particularly enriched in the 
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Table 5 Top 50 InterPro Annotations in the An. albimanus transcriptome 



Rank in An. albimanus 


Transcripts 


Description 


InterPro ID 


Total hits 


Rank in An. gambiae 


1 


233 


Zinc finger, C2H2-type 


IPR007087 


2001 


1 


2 


194 


Protein l<inase-lil<e domain 


IPROl 1009 


204 


3 


3 


158 


Armadillo-type fold 


1 PRO 16024 


182 


4 


4 


127 


WD40 repeat 


IPR001680 


605 


5 


5 


106 


Serine/cysteine peptidase, trypsin-like 


IPR009003 


114 


2 


6 


98 


Major facilitator superfamily transporter 


IPR016196 


102 


10 


7 


90 


RNA recognition motif, RNP-1 


IPR000504 


327 


12 


8 


73 


Cytochrome P450 


IPROOl 128 


361 


16 


9 


68 


Thioredoxin-like fold 


IPROl 2336 


84 


24 


10 


57 


Ankyrin repeat 


IPR0021 10 


452 


19 


1 1 


57 


ATPase, AAA + type, core 


IPR003593 


65 


8 


12 


51 


Glycoside hydrolase, catalytic core 


IPROl 7853 


52 


28 


13 


48 


EF-HAND 2 


IPROl 8249 


113 


37 


14 


47 


Ras 


IPROl 3753 


48 


26 


15 


46 


Src homology-3 domain 


IPROOl 452 


194 


40 


16 


43 


Pleckstrin homology 


IPROOl 849 


98 


35 


17 


41 


Short-chain dehydrogenase/reductase SDR 


IPR002198 


133 


45 


18 


41 


Protein of unknown function DUF227 


IPR0041 19 


46 


69 


19 


40 


DEAD-like helicase, N-terminal 


IPR014001 


40 


20 


20 


38 


Ras GTPase 


IPROOl 806 


150 


26 


21 


38 


Leucine-rich repeat 


IPR00161 1 


93 


6 


22 


38 


Small GTPase, Rho type 


IPR003578 


45 


48 


23 


37 


Immunoglobulin E-set 


IPR014756 


48 


30 


24 


36 


Tetratricopeptide repeat 


IPROl 9734 


275 


31 


25 


36 


Immunoglobulin-like 


IPR0071 10 


58 


13 


26 


32 


Glutathione S-transf erase, C-terminal-like 


IPROl 0987 


56 


79 


27 


31 


PDZ/DHR/GLGF 


IPROOl 478 


122 


32 


28 


31 


BTB/POZ fold 


IPROl 1333 


61 


23 


29 


28 


Heat shock protein DnaJ, N-terminal 


IPROOl 623 


137 


87 


30 


28 


Concanavalin A-like lectin/glucanase 


IPR008985 


35 


43 


31 


28 


Homeodomain-like 


IPR009057 


35 


14 


32 


27 


General substrate transporter 


IPR005828 


27 


49 


33 


25 


Glucose/ribitol dehydrogenase 


IPR002347 


116 


51 


34 


25 


Cellular retinaldehyde-binding, C-terminal 


IPROOl 251 


96 


53 


35 


24 


Ubiquitin-conjugating enzyme, E2 


IPR000608 


85 


137 


36 


23 


Fibronectin type III domain 


IPR008957 


66 


38 


37 


23 


Histone-fold 


IPR009072 


44 


47 


38 


22 


Chitin binding protein, peritrophin-A 


IPR002557 


148 


22 


39 


21 


Immunoglobulin l-set 


IPROl 3098 


38 


17 


40 


20 


Peptidase Ml, membrane alanine aminopeptidase, N-terminal 


IPROl 4782 


67 


133 


41 


19 


Fibrinogen, alpha/beta/gamma chain, C-terminal globular 


IPR002181 


72 


57 


42 


19 


GPCR, rhodopsin-like superfamily 


IPROl 7452 


19 


25 


43 


17 


SH2 motif 


IPR000980 


108 


97 


44 


17 


Carboxylesterase, type B 


IPR002018 


32 


59 


45 


17 


7TM GPCR, rhodopsin-like 


IPR000276 


77 


21 


46 


16 


WW/Rsp5/WWP 


IPROOl 202 


100 


145 


47 


16 


Protease inhibitor 14, serpin 


IPR000215 


63 


117 
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Table 5 Top 50 InterPro Annotations in the An. albimanus transcriptome (Continued) 



48 


15 


C-type lectin fold 


IPR016187 


17 


75 


49 


15 


Protein-tyrosine phosphatase, receptor/non-receptor type 


IPR000242 


92 


NP' 


50 


14 


Peptidase Ml 4, carboxypeptidase A 


IPR000834 


48 


154 



Not present. 



anterior midgut of An. gambiae larvae [43], suggesting 
that detoxification may also be maintained in the adult 
midgut. E2 ubiquitin conjugating enzymes are part of a 
general housekeeping homeostatic mechanism [44], and 
to our knowledge, they do not play a particular role in 
midgut physiology. However, in D, melanogaster, up- 
regulation of the ubiquitin-conjugating enzyme homolog 
coded by vihar and the 26 S proteosomal subunit RPN9 
in response to dietary Bowman-Birk inhibitor (BBI) in- 
toxication has been described [45]. It can be speculated 
that overrepresentation of E2 ubiquitin conjugating 
enzymes in the mosquito midgut may be part of a gen- 
eral stress response to different xenobiotics. 

Structurally, the DUF227 domain overlaps with the 
SCOP Protein kinase-like (PK-like) superfamily 
(IPR011009). The presence of DUF227 containing pre- 
dicted homologs was investigated by querying electron- 
ically inferred orthology in the BioMart database [46] 
and found that DUF227 is present in a family of proteins 
well represented in dipterans, nematodes, and to a lesser 
extent in other insects, but absent in other invertebrates 
such as the sea urchin. The number of DUF227 contain- 
ing orthologs according to BioMart decreased with 



phylogenetic distance so that between An. gambiae and 
Ae, aegypti there were sixteen 1:1 orthologs and between 
An, gambiae and D, melanogaster or Pediculus humanus 
there were only nine and five, respectively. Using Vec- 
torBase and MozAtlas for microarray data mining to 
identif)^ significant changes in gene expression {P < l.E" 
^) for those An. gambiae genes containing the DUF227 
domain, we found predominant expression in the Mal- 
pighian tubules, midgut and head of adult mosquitoes, 
with expression levels higher in males than in females 
[47]. In the midgut, 34 out of 43 genes had a signifi- 
cantly modified expression pattern after a blood meal 
[42]. Five genes were enriched in hemocytes and another 
five were enriched in the carcass [48]. Two genes were 
induced during P. berghei midgut invasion [49]. 

Conversely, certain annotations that are abundantly 
represented in An, gambiae such as Insect cuticle pro- 
teins (IPR000618); Tropomyosin (IPR000533); 7TM 
chemoreceptor (IPRO 13604); Mitochondrial Rho-like 
(IPR013684); Olfactory receptor, Drosophila 
(IPR004117); Pollen allergen Poa plX/Phl pVI, C- 
terminal (IPR001778) were substantially scarce in our 
dataset (Table 5). As expected, olfactory receptors, 7TM 



A 

Cytoskeleton org 
Protein transport 
Ion transport 
Translation 
Behavior 
Signal transduction 
Protein modification 
Reproduction 
Amino acid metabolism 
Cell differentiation 
Cell death 
Transcription 
Response to stress 




B 

Calcium ion binding 
Actin binding 
Translation factor activity 
Stmctural molecule 
RNA binding 
Nucleotide binding 
Enzyme regulator 
Protein kinase 
Nucleoside-triphosphatase 
Transmembrane transporter 
Receptor 
Transcription factor 
Electron carrier 
Peptidase 




Figure 6 Protein identity according to Gene Ontology. An. olbimonus transcriptome was GO annotated using Blast2G0 and partitioned 
according to biological process or molecular function. A) Average percent identity of biological function classes with respect to An. gambiae in 
the ortholog dataset. Average protein identity of response to stress and transcription genes was significantly lower than cytoskeleton 
organization and protein transport genes P < 0.001. Kruskal-Wallis test and Dunn's multiple comparison test). B) Average protein identity of 
best hit matches with respect to An. gambiae categorized according to molecular function. Average identity of electron carrier function and 
Peptidases were significantly lower than calcium and actin binding, translation factor and structural function genes P < 0.001. Kruskal-Wallis 
test and Dunn's multiple comparison test). For A and B, red lines indicate the average protein identity of the corresponding GO annotated 
subset. Blue lines indicate overall protein identity. 
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chemoreceptors, and pollen allergen were underrepre- 
sented protein annotations since they are not expressed 
in the midgut [47]. Similarly, insect cuticle proteins that 
are involved in adult cuticle synthesis show a peak dur- 
ing metamorphosis and are under-expressed in adults 
[50]. 

Immunity related genes 

Mosquito immune responses are thought to play an im- 
portant role in influencing vectorial competence. Higher 
protein divergence in genes implicated in stress response 
(Figure 6) is in agreement with the observation of higher 
divergence rate in immunity related genes (IRGs) than 
in genes involved in other core cellular processes [33,36- 
38]. Thus, we searched for potential immunity related 
genes by comparing our transcript sequences to a manu- 
ally curated dipteran IRG dataset, which includes 385 
genes classified into 27 families implicated in recogni- 
tion, regulation, signal transduction and effector phases 
of the immune response [38,51]. We found 413 best 
BLASTX reciprocal matches between our dataset and 
the ImmunoDB, representing all 27 families. However, 
this number may be an overestimation of the true num- 
ber of IRGs in our dataset, because it includes a large 
number of structurally homologous proteins that may 
not be involved in insect immune responses (for ex- 
ample many proteases and tyrosine kinases). To refine 
our IRG search, we considered a rather conservative ap- 
proach that contemplated the putative orthology as 
described in the previous section (Figure 5A), as well as 
structural signatures derived from the InterPro annota- 
tion. Among our 3,772 ortholog 1:1:1 dataset, 79 ortho- 
log groups have matches in the ImmunoDB (Additional 
file 1) and 73 were consistently supported by InterPro 
annotation or other protein family classification system. 
This set was further increased by the inclusion of three 
unannotated genes of the Toll and Imd pathways {TOLL- 
PATHl, IMDPATH5 and IMDPATH8) with supporting 
structural evidence based on InterPro annotation. This 
82 IRG putative ortholog dataset exhibited 75% average 
identity to their corresponding An. gambiae proteins, 
which is very similar to the average identity of all An, 
albimanus: An, gambiae matches. However, average 
identity was lower than the average identity of GO slim 
annotated ortholog proteins (84%). 

Gene duplication has been a major evolutionary force 
shaping immunity related genes in dipterans [38]. Thus, 
the identification of putative IRGs in our dataset based 
solely on the inclusion of ortholog groups may be miss- 
ing putative IRGs. Thus, by considering InterPro annota- 
tions, we found representative matches in 25 out of 27 
families, representing the recognition, signaling, regula- 
tion and effector immune response processes. 



Among the recognition phase' genes, different patho- 
gen recognition molecules were identified such as six 
Peptidoglycan Recognition Proteins (PGRPs), including 
the putative PGRP-LD ortholog [52]; three 1-3 p-glucan 
binding proteins (BGBPs), including the GNBPB2 and 
GNBPAl orthologs [53]; six C-type lectins (CTLs), in- 
cluding the CTL6 ortholog [54] and 19 Fibrinogen 
Related Proteins (FREPs), including the An, gambiae 
FREP3 ortholog (Locus_25924_Length_958), but not An, 
gambiae FREP9 ortholog which has been implicated in 
the dinti'Plasmodium response [55] (Figure 7). We fur- 
ther identified six thioester containing proteins (TEPs) 
based on InterPro annotation (IPR009048) [56] that 
were missed by the orthology approach. Proteins con- 
taining Leucine-Rich Repeats (LRR) are highly abundant 
in metazoans and are involved in molecular recognition 
in a wide variety of biological processes. As mentioned 
in the previous section, we identified 38 transcripts con- 
taining LRR domains (IPR001611). However, mosquitoes 
possess a unique type of LRR-domain among proteins 
that are involved in immune responses [57]. Apart from 
the LRR, these proteins share structural features includ- 
ing a conserved pattern of cysteine residues and coiled- 
coil domains. We found one putative Leucine Rich Re- 
peat immune protein (LRIM) that displays structural fea- 
tures of the Long LRIM subfamily and two that show 
compatible features with the short LRIM subfamily, in- 
cluding the LRIM6 putative ortholog (Locus_11561_- 
Length_1221) [57]. 

The major immune response signal transduction path- 
ways in dipterans are the Toll and the Imd pathways. 
Gene members of these pathways tend to be better con- 
served in different mosquito species than genes impli- 
cated in the recognition or effector phases [38,58]. 
Although transcripts containing LRR or ToU/interleu- 
kin-1 receptor homology (TIR) domain (IPR000157) 
were found, a clear Toll receptor homolog was not. 
However, we found important members of the Toll path- 
way such as PELLE, MYD88 and TRAF6; the Imd path- 
way such as CASPAR, IKKl and IKK2 [38] and STAT 
pathway such as DOME, SOCS and STAT2 [59,60] (Add- 
itional file 1), as well as REL2 transcription factor [61]. 

Autophagy was originally described as a cellular re- 
sponse to starvation. However, it has recently been 
shown to be a critical process related to immune and 
stress responses, clearance of intracellular pathogens and 
damaged organelles, as well as cell survival. There are 
several genes involved in the induction of autophagy, 
autophagosome nucleation, autophagosome expansion 
and autophagosome recycling [62]. There are 20 An, 
gambiae entries in the ImmunoDB related to autophagy. 
Given the importance of all the mentioned processes in 
malaria transmission, we searched the ImmunoDB for 
orthologs and found 12. In the induction of the pathway. 
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Locus_25924_Length_958 

Anopheles albimanus 



Sequence: 



CTGTCTTGCG CACCAATGCG CGGTGACCCT CTCGAGGTCG AACCATTTCA GGAGCCGAAA ACCACGCTTA 70 

TGCTACAGAT TCTCAAACGG TTAGACGAGA TTGATCAACG AATAAGCCTC GTCGAGCGCA AGATACAGGG 140 

CAATCAGCAG ATTGTGGTGT CAACCATTAC GAGGCAACTG GGCGGCATCA GTGATAAGCT CGAGAGTATC 210 

AACAAAACAT TACGTCAAAA TGTGGGCACA TACTACGAAT CATGCGTTGA GGCACCCGAA ATGATGAGCG 280 

ATGTGTATAT GATCAGACCA CCAGCTTCGA GCGTGGCGCC CTTCAAGGCG TTCTGCGAGC AGCGCTACCA 350 

GCAAGGTGGC TGGATGCTTC TGATGCGTCG GTTCGATGGT TCGCTCAACT TTACGCGCAG CTGGACCGAC 420 

TATCGCGATG GGTTCGGTGA TGTAGCCGCT GAGCATTGGA TTGGGCTAGA GCGAATGCAC CGCATTACTA 490 



Alignments: 



B 



Species/assembly 

AgamP3 



Method 

exonerate est2genome 



Location 

3L 20791786-20792606 




Chromosome bands 
Contigs 

VectorBase gene 

Anopheles albino... 
%GC 



Gene Legend 



< AGAP011307-RA 
protein coding 



Locus_25924 Lenglh_958 (8443) 



20.792.000 20.792.200 
1.64 Kb 



20.791.400 20.791.600 20.791.800 

Revefse strand 
protein coding 
There are currently 64 tracks turned off. 

VectorBase Anopheles gambiae version 63.3 (AgamP3) Chromoson>e 3L: 20.791.376 - 20.793,017 



InterProScan hits 



translation [^^^^ 


analysis 
type 


analysis domain 
ID 


analysis description 


e- InterPro , ^ n . 

value ID InterPro description 


GO terms 


0RF1(+1) 292 


ProfileScan 


PS51406 


FIBRIN0GEN_C_2 


Fibrinogen. 

45.768 IPR002181 alpha/beta/gamma chain, C- 
terminal globular 


Molecular Function: receptor 
binding (00:0005102). 
Biological Process: signal 
transduction (00:0007165) 



BLASTX best hits to AgamP3.6 peptides 

Anopheles gambiae best hit %ldentity e-val Score Frame ^Coverage (subject) Subject length Description 

AOAP011307-PA 72 1e-77 285 +1 100 175 jhypotheticai protein|protein_codingl3L:20791869- 

A^AFunju/KA fz leffZQ^ +1 luu ut) 20792470:- llgene AGAPOl 1307 



BLASTX best hits to Anopheles darlingi peptides 



Anopheles darlingi best hit 


%identity 


aligned length 


#mismatches #gaps subject start 


subject end e-value bit score 


EFR20344 1 


79 06 


234 


49 2 75 


288 7e-104 372 



Figure 7 Genome de-linked annotation viewer. Screen shots of the An. albimanus transcript Locus_25924_Length_958 corresponding to the 
FREP3 ortholog. A) Sequence pane. B) Alignment pane. An exonerate alignment to the An. gambiae reference genome can be displayed via the 
Distributed Annotation System (DAS) in the VectorBase genome browser, by following the link. The lower panes correspond to annotations 
which include InterPro annotation (C), and BLASTX comparisons (D) to An. gambiae, An. darlingi, D. melanogaster, Ae. aegyptl, C. qulnquefasclatus, 
I. scapularls and P. humanus. For A, C and D only partial information is shown. 



only the TOR homolog was found. Similarly, the only 
homolog involved in the nucleation phase that we found 
was the BCL-2 homolog DEBCL {buffy in D. melanoga- 
ster) [63]. The remaining ten corresponded to genes 
involved only in autophagosome expansion such as 
APG3, APG4A, APG4B, APG7A, APG7B, APG8 and 
APG16 orthologs or autophagosome recycling such as 
APG2, APG9 and APG18 (Additional file 1). In contrast, 
we found a relative depletion of caspases (only CASPS7 



and CASPLl, of 14 caspases in An. gambiae) and the ab- 
sence of caspase activators {AKR and Michelob_x), sug- 
gesting that, as in larvae midgut [64], autophagy could 
be a more active homeostatic tissue process than apop- 
tosis in adult mosquitoes. 

Conclusions 

We have explored the adult female transcriptome of an 
important New World malaria vector. An, albimanus, by 
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sequencing cDNA libraries generated from different tis- 
sue sources related to the Plasmodium life cycle such as 
midgut, cuticular fat body, dorsal vessel and salivary 
gland. Merging Sanger and NGS data into a single as- 
sembly generated a robust dataset with adequate tran- 
script lengths that could be effectively mapped to the 
An, darlingi genome, covered 90% of the core eukaryotic 
genome and half of the predicted proteome of other 
mosquito vectors. We identified protein-coding tran- 
scripts involved in biological processes such as immune 
recognition, immune signaling pathways, insecticide re- 
sistance and autophagy that may be relevant to the Plas- 
modium cycle or may represent targets for novel control 
strategies. As a result of this work, the genomic informa- 
tion available for An, albimanus has increased several 
hundred-fold, thus providing molecular inputs for re- 
search in this species: 1) from a single gene perspective; 
2) to gain insight into the anopheline radiations in the 
New World; 3) facilitating further genomic and prote- 
omic approaches; and 4) assisting gene finding and val- 
idation of the An, albimanus genome in the context of 
the Anopheles cluster genome sequencing project [2]. 
Sequence information, predicted proteome comparisons, 
transcript mapping to the An, gambiae genome and 
InterPro annotations described in this manuscript are 
accessible to the community through the VectorBase 
website (http://www.vectorbase.org/Other/AdditionalOr- 
ganisms/). 

Materials and methods 

Mosquitoes and mosquito infections with P. y'wax 

All the mosquito samples used in the present work were 
3-5 day post-emergence female An, albimanus of the 
white-stripe strain [65] obtained from the insectary of 
the National Institute of Public Health (INSP) in Cuer- 
navaca, Mexico. Mosquitoes were fed with 10% sucrose 
ad libitum and reared in a 12:12 h light cycle maintained 
at 28 °C and 80% relative humidity. 

Mosquito infections with P, vivax CSP-VK210 and ex- 
traction of midgut epithelium of 24 h and seven days 
after an infectious blood meal were performed as 
described [66] in the insectary of the Centro Regional de 
Investigacion en Salud Publica (CRISP) in Tapachula, 
Chiapas, Mexico, according to Institute ethical guide- 
lines and approval. 

cDNA libraries for Sanger sequencing 

To capture transcripts from mosquito organs that are 
relevant for the Plasmodium sp life cycle, three cDNA li- 
braries were generated for Sanger sequencing: A 
sucrose-fed female midgut (site of invasion) and a saliv- 
ary gland (site of sporozoite maturation and inoculation 
to vertebrate host) cDNA libraries were constructed 
from 5 [ig of total RNA. cDNA was ligated to Uni-ZAP 



XR vector (Stratagene). The phage library was mass 
excised and plated into LB-agar plates. Individual col- 
onies were replicated in 384-well plates. A third cDNA 
library was constructed from 0.5 [ig of Poly A + RNA 
extracted from whole female An, albimanus 12 h after 
inoculation with 0.0004 OD of Serratia marcescens in 
the hemocoel, which elicits an immune response that 
limits the development of P, vivax [67] . cDNA synthesis 
and library construction was done using the Creator 
SMART cDNA Library Construction Kit (Clontech) 
according to the manufacturer instructions. The library 
was transformed in Escherichia coli by electroporation 
and plated in LB-Agar with chloramphenicol (30 (ig/ 
mL). Individual colonies were replicated in 384-well 
plates. 

Template preparation and Sanger sequencing 

E, coli clones were inoculated in CIRCLEGROW® 
(Krackler scientific) liquid media with either chloram- 
phenicol (pDNA-lib) or ampicilin (pBluescript) at 37 °C 
in 96-well plates for 16 h. Plasmid DNA was prepared 
by alkaline lysis in Millipore filters and ethanol- 
precipitated and suspended in sterile deionized water. 
Sequencing was performed with fluorescent dye termi- 
nators in a 3100 Genetic Analyzer (Applied Biosystems). 

EST processing 

Raw chromatogram files were quality assessed and 
trimmed with Phred using the trim_alt command with 
default parameters [68] and then converted to FASTA 
files using PH2FASTA. Vector sequence and linker 
sequences were removed using CrossMatch [69] and 
SeqClean [70]. Identification of mitochondrial and ribo- 
somal protein transcripts was done by BLASTn searches 
to mosquito mitochondrial genomes or ribosomal pro- 
tein sequence databases and filtered out. ESTs were sub- 
mitted to dbEST at NCBI (Accession Numbers 
EV406110.1 - EV410194.1). 

454 sequencing 

Abdominal cuticles and the underlying fat body were 
obtained from 20 female adult mosquitoes and total 
RNA was extracted with TRIzol (Life Technologies). In- 
tegrity of RNA was verified in the Agilent Bioanalyzer 
standard RNA chip. One (ig of RNA was used for full 
length RT-PCR amplification using the Super-SMART 
PCR cDNA synthesis kit (Clontech) according to the 
manufacturers instructions. The PCR amplified cDNA 
library was fragmented by nebulization and subjected to 
library preparation according to the 454 shotgun se- 
quencing protocol. After emulsion PCR titration and 
amplification, the library was sequenced in a full picoti- 
ter plate using the Genome Sequencer FLX platform. A 
similar approach was used to generate additional midgut 
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libraries from P. vivax infected- mosquitoes at 24 h post- 
infective blood meal (PIBM) and seven days PIBM, but 
sequenced as pooled bar-coded libraries in half picotiter 
plate. A third sequencing 454 run was performed with 
two cDNA libraries from dissected dorsal vessels, 
obtained at 18 h post-inoculation (intra-hemocoelic) 
with 0.25 \A of soluble fraction of zymosan (10 (ig glu- 
cose-equivalents/ml, Sigma) as described [71] and 
saline-inoculated mosquitoes respectively. Dorsal vessels 
were collected in RNAlater (Ambion) and stored at -80° 
C. After RNAlater removal, total RNA was extracted 
with the RNAeasy Kit (Qiagen) and amplified with the 
SMARTER Pico PGR cDNA synthesis Kit (Clontech), 
and sequenced in a full picotiter plate (one region per li- 
brary) using the GS FLX Titanium platform. Primer 
adaptors used for cDNA library generation were 
trimmed after signal processing using SeqClean. 454 se- 
quence data was submitted to the Sequence Read Arch- 
ive (SRA) (Accession number: SRA052091). 

Illumina sequencing 

Total RNA from 50 An, albimanus midguts was extracted 
using TRIzol, DNased and cleaned with an RNAeasy col- 
umn (Qiagen) according to the manufacturer s instructions. 
Total RNA was then quality controlled for integrity on a 
Bioanalyzer (Agilent Technologies). mRNA libraries were 
constructed and sequenced, as previously described [72-74] 
on a single lane of a Illumina HiSeq 2000, which generated 
-210 million 101 bp paired end reads. Illumina sequence 
data was submitted to the Sequence Read Archive (SRA) 
(Accession number: SRA051893). 

Assembly 

The entire Illumina read set was split into eight equal 
sized read sets. Each one of these Illumina read sets was 
merged with the 454 and Sanger data and assembled 
using the Velvet [75] and Oases [76] software packages 
using three different kmer sizes (43, 45 and 47). The 
resulting contigs for each assembly were run again 
through Velvet and Oases to produce a final assembly. 
We then filtered the final assembly to retain only those 
loci that contained a single transcript, that were longer 
than 300 bp, and that had confidence scores of 1.0. To 
address which contigs contained 454 or Sanger reads, all 
454 and Sanger reads were re-mapped to the initial as- 
sembly using the GS Reference Mapper v.2.5.3 using de- 
fault parameters. Unmapped reads were re-assembled 
using the GS assembler v2.5.3 on cDNA mode to yield 
an additional set of 935 contigs. 

Genome mapping 

The An, gambiae (AgamP3) [77] and An, darlingi gen- 
omes [18] were softmasked with RepeatMasker [78]. An, 
albimanus transcripts were aligned to either genome 



using Exonerate v. 2.2 [79] with the EST2 Genome 
mode, and a threshold score of 300, and maximum in- 
tron length of 20,000 bp. 

Transcript annotation 

Gene ontology annotations were performed using Blas- 
t2Go [34]. For the Initial BLASTX against the NCBI-nr 
database the command-line option "-el-e-6" was used. 
Additionally, transcripts were annotated according to 
the InterPro databases using InterProScan [35] in six- 
frame translation mode. Kruskal-Wallis test followed by 
Dunns correction was performed to calculate statistical 
differences within GO classes and protein percent iden- 
tity with the Graph Pad PRISM software. 

In silico proteome comparison 

The entire assembled transcript dataset was used to 
search for the best hit homologous proteins (BLASTX 
cut-off e-value l.OE'^) in the An, gambiae (AgamP3.6), 
Ae, aegypti (AaegLl.2); Culex quinquefasciatus 
(Cpipjl.2), P, humanus (PhumUl.2) and Ixodes scapu- 
lar is (IscaWl.l) predicted proteomes present at Vector- 
Base [77], as well as the An, darlingi [18] and D, 
melanogaster [80] proteomes. Ortholog prediction was 
done by performing BLASTX and TBLASTN bidirec- 
tional comparisons between An, albimanus, An, darlingi 
and An, gambiae (e value l.OE'^) to identify the best re- 
ciprocal hits within the three species. 

To identify the proportion of the core eukaryotic gen- 
ome covered by the An, albimanus transcriptome, we 
used HMM profiles corresponding to the 458 core 
eukaryotic proteins as provided by the CEGMA algo- 
rithm [32]. Local HMMER3 searches [81] were cali- 
brated using the An, gambiae core eukaryotic protein 
validated dataset consisting of 453 sequences [82]. 
HMMER3 was performed using hmmscan command 
and the "-T 40" and "— domT 40" filters against the An, 
albimanus predicted proteome, as well as the predicted 
proteomes of An, darlingi, An, gambiae, Ae, aegypti, C, 
quinquefasciatus and D, melanogaster, 

Web interface 

Many web-based genome browsers [83-85] are available 
as open source software and are well suited to displaying 
transcript annotations. However they are heavily 
dependent on the availability of a genome sequence to 
act as a coordinate system. It is possible to adapt gen- 
ome browsers to work without genomes [86] but it is 
not easy to keep them synchronized with developments 
in the "parent" software. In this work, we chose to de- 
velop a standalone "genome-free" web application, called 
GDAV (Genome-Delinked Annotation Viewer). It was 
designed with flexibility in mind; it can handle any kind 
of sequence annotations and integrates with genome 
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browsers of closely related species via the DAS protocol 
[87]. The free open source code is available at https:// 
github.com/VectorBase/GDAV and the software was 
developed within the auspices of VectorBase [77] . 

The key to GDAV s flexibility is its use of three simple, 
open text file formats for loading data: FASTA for 
sequences, tab-delimited files for annotations and GFF3 
for genome alignments. Only the loading of one or more 
FASTA files is mandatory, thereafter zero or more anno- 
tation and alignment files may be loaded into GDAVs 
small MySQL schema using the supplied Perl scripts. 
The annotation file consists of rows of data identified by 
the sequence ID in the first column, and subsequent 
named columns providing arbitrary text annotations. 
The Java-based web interface is simple to deploy within 
a Java web server such as Apache Tomcat. The web 
interface, with its integrated search facility, treats all 
annotations as plain text— no special treatment of nu- 
meric data (e.g. range queries or unit conversions) is 
provided. Link-outs to third party databases from spe- 
cific columns containing suitable IDs are possible 
through the configuration file. A Java-based DAS server 
based on Dazzle [88] is bundled with GDAV. It can be 
used to display GFF3 file-derived gapped alignments (e. 
g. exon-intron structure) of the sequences stored in 
GDAV with respect to the genomes of one or more 
closely related species. Any alignment features shown 
via DAS in genome browsers link back to the sequence 
report page in GDAV. 

In this study, the annotation files loaded into the sys- 
tem include the InterPro domain assignments and the 
BLASTX results providing "best hits" to several other 
proteomes. The GFF3-format exonerate alignments 
described above were also loaded into the system. 

Additional File 



Additional file 1. Immunity related putative ortholog genes^. 
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